Optimize Sequence packing to avoid table lookup in write loop#234
Optimize Sequence packing to avoid table lookup in write loop#234
Conversation
…ression Packed the offset slot into the unused high bits of the length field in the `Sequence` struct. This allows the offset slot to be pre-calculated during the match finding phase, avoiding a random access to the 32KB `OFFSET_SLOT_TABLE` during the critical bitstream writing phase. - Modified `Sequence` struct to provide helper methods for packing/unpacking. - Updated `decide_greedy_sequences`, `compress_near_optimal_block`, and `compress_greedy_block` to pack the offset slot. - Updated `write_match_fast` and `write_match` to use the pre-calculated offset slot. Co-authored-by: 404Setup <[email protected]>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Optimized the
Sequencestruct packing insrc/compress/mod.rsto store the offset slot (off_slot) alongside the match length. Thelengthfield (u16) only uses 9 bits for the maximum match length of 258, leaving 7 bits free. The offset slot (0-29) fits in 5 bits, so it is packed into the high bits oflength.This change allows the offset slot to be calculated once during the match finding phase (where it is already needed for frequency statistics) and stored in the sequence. During the bitstream writing phase (specifically in
write_matchandwrite_match_fast), the code now uses this pre-calculated value instead of performing a lookup in theOFFSET_SLOT_TABLE(32KB). This reduces cache pressure in the hot write loop.Verified with
cargo testand manual benchmarking. While simple microbenchmarks showed neutral results (likely due to cache locality in the test case), this change theoretically improves performance for more complex workloads by reducing the working set size during bitstream generation.PR created automatically by Jules for task 16937576356832658656 started by @404Setup